sample s = select.books() which needs to be proceeded byfetchData("simulate.r")
## Retrieving from
## http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE
fetchData("mLM.R")
mLM( wage ~ sex + sector, data=CPS85 )
var()) provides a nice way to do this because it has a partitioning process.The variance of the residuals plus the variance of the fitted equals the variance of the measured values.
This is much like the Pythagorean theorem.
It means we can measure “how much” variation has been “explained” by the model, and how much remains unexplained.
Try it on any grouping you like.
ACTIVITY: Construct a few models of the wage variable in the CPS85 data set, using different explanatory variables.
Accuracy versus Precision
The dictionary gets it substantially wrong:
precision |priˈsi zh ən| noun the quality, condition, or fact of being exact and accurate : the deal was planned and executed with military precision. • [as adj. ] marked by or adapted for accuracy and exactness : a precision instrument. • technical refinement in a measurement, calculation, or specification, esp. as represented by the number of digits given : this has brought an unprecedented degree of precision to the business of dating rocks | a precision of six decimal figures. Compare with accuracy . ORIGIN mid 18th cent.: from French précision or Latin praecisio(n-), from praecidere ‘cut off’ (see precise ).
Thesaurus synonyms: tools crafted with precision: exactness, exactitude, accuracy, correctness, preciseness; care, carefulness, meticulousness, scrupulousness, punctiliousness, methodicalness, rigor, rigorousness.
Today is all about Precision and how to measure and convey it. For us, the meaning of precision is “the degree of repeatability” or “reproducibility.” Sometimes this is conveyed by the “number of digits”, which is to say “the position of the last digit that matters”.
One wikipedia definition is useless: precision (statistics)http://en.wikipedia.org/wiki/Precision_(statistics).
Another one, accuracy and precision is more on target.
We saw that the choice of parameters in a model could be automated: fit the model, thereby choosing the parameters that eliminate bias in the residuals and minimize the variance.
Of course, the chosen parameters depend on the data to which the model is fitted. Insofar as this data is a random sample from a population, the parameters themselves are random. The question for use today is, “How random?”
Sample 100 runners from the runners data set and find the mean running time. Because there is some missing data, you'll need to do something like this:
options(na.rm=TRUE)
options(na.rm=TRUE)
run = fetchData("repeat-runners.csv")
## Retrieving from
## http://www.mosaic-web.org/go/datasets/repeat-runners.csv
nrow(run)
## [1] 24334
mysamp = sample(run,size=100)
mymod = mm( gun ~ sex, data=mysamp)
mymod
##
## Groupwise Model Call:
## gun ~ sex
##
## Coefficients:
## F M
## 96.2 81.7
Across the class, comparing the different students, see what the range of values is.
Do it several times on the instructor's computer.
do(5) * mm( gun ~ sex, data=sample(run,size=100))
## F M sigma r.squared
## 1 101.31 NA NA NA
## 2 94.61 87.16 14.53 0.0629
## 3 97.95 86.06 12.70 0.1743
## 4 98.78 83.92 15.72 0.1840
## 5 100.27 84.55 14.02 0.2259
Do it many times with do and plot out the sampling distribution.
trials = do(100) * mm( gun ~ sex, data=sample(run,size=100))
densityplot( ~F, data=trials )
densityplot( ~M, data=trials )
Or, to put them both on one plot, which I want to do to compare to the population distribution, here's a more esoteric command:
require(reshape2)
## Loading required package: reshape2
densityplot( ~value, groups=variable, data=melt(trials[,1:2]))
## Using as id variables
How do we measure the width of the sampling distribution. With the standard deviation of the distribution.
sd(trials)
## Error: Error in is.data.frame(x): (list) object cannot be
## coerced to type 'double' Did you perhaps omit data= ?
This is called the standard error of the distribution.
IMPORTANT the standard error of the distribution is not at all the same as the distribution of the population. Here's that distribution of individual net running times:
densityplot( ~gun, groups=sex, data=run)